Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)

نویسنده

  • Jon Stearley
چکیده

The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercomputer RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 [1] specification which is widely used in the semiconductor manufacturing industry.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems

This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, mu...

متن کامل

Towards a Specification for Measuring Red Storm Reliability, Availability, and Serviceability (RAS)

The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved, hinders their solution, and increases total system cost. Seeking to foster a common basis for communication about supercomputer RAS, [1] proposed a general system state model, definitions, and measurements based on the SEMI-E10 specification [2] used in the semiconductor ma...

متن کامل

Reliability, availability, and serviceability (RAS) of the IBM eServer z990

serviceability (RAS) of the IBM eServer z990 M. L. Fair C. R. Conklin S. B. Swaney P. J. Meaney W. J. Clarke L. C. Alves I. N. Modi F. Freier W. Fischer N. E. Weber The IBM eServer zSeries Model z990 offers customers significant new opportunity for server growth while preserving and enhancing server availability. The z990 provides vertical growth capability by introducing the concurrent additio...

متن کامل

Measuring Fault Tolerance Overhead in Multi-Run Scientific Computations

Knowing the beneficial or productive usage time for large high performance computing (HPC) platforms is important for computing metrics that capture the reliability, availability, and serviceability (RAS) of the platform. Currently, application execution time is generally accounted as all productive system time, yet large-scale, long-running applications incur fault tolerance overhead such as c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005